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Abstract 

We consider large-scale networks with n nodes, out of 
which k are in possession, (e.g., have sensed or collected in 
some other way) k information packets. In the scenarios in 
which network nodes are vulnerable because of, for exam- 
ple, limited energy or a hostile environment, it is desirable 
to disseminate the acquired information throughout the net- 
work so that each of the n nodes stores one (possibly coded) 
packet and the original k source packets can be recovered 
later in a computationally simple way from any (1 + e)k 
nodes for some small e > 0. 

We developed two distributed algorithms for solving 
this problem based on simple random walks and Fountain 
codes. Unlike all previously developed schemes, our solu- 
tion is truly distributed, that is, nodes do not know n, k 
or connectivity in the network, except in their own neigh- 
borhoods, and they do not maintain any routing tables. In 
the first algorithm, all the sensors have the knowledge of 
n and k. In the second algorithm, each sensor estimates 
these parameters through the random walk dissemination. 
We present analysis of the communication/transmission and 
encoding/decoding complexity of these two algorithms, and 
provide extensive simulation results as welR 



1 Introduction 

Wireless sensor networks consist of small devices (sen- 
sors) with limited resources (e.g., low CPU power, small 
bandwidth, limited battery and memory). They can be 
deployed to monitor objects, measure temperature, detect 
fires, and other disaster phenomena. They are often used in 
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isolated, hard to reach areas, where human involvement is 
limited. Consequently, data acquired by sensors may have 
short lifetime, and any processing on it within the network 
should have low complexity and power consumption lT8ll . 

We consider a large-scale wireless sensor networks with 
n sensors. Among them, k >C n sensors have collected 
(sensed) some information. Since sensors are often short- 
lived because of limited energy or hostile environment, it is 
desirable to disseminate the acquired information through- 
out the network so that each of the n nodes stores one (pos- 
sibly coded) packet and the original k source packets can 
be recovered in a computationally simple way from any 
(1 + e)k of nodes for some small e > 0. Here, the sen- 
sors do not know locations of each other, and they do not 
maintain any routing tables. 

Various solutions to the centralized version of this prob- 
lem have been proposed, and are based on well known 
coding schemes such as Fountain codes |6| or MDS 
codes 11611 . To distribute the information from multiple 
sources throughout the network so that each node stores 
a coded packet as if obtained by centralized LT (Luby 
Transform) coding lfT2l . Lin et al. ifTTI proposed a solu- 
tion that uses random walks with traps. To achieve the de- 
sired code degree distribution, they employed the Metropo- 
lis algorithm to specify transition probabilities of the ran- 
dom walks. In this way, the original k source packets are 
encoded by LT codes and the decoding process can be done 
by querying any (l+e)fc arbitrary sensors. Because of prop- 
erties of LT codes, the encoding and decoding complexity 
are linear and therefore have low energy consumption. 

In the methods of IfTTI . the knowledge of the total num- 
ber of sensors n and sources k is required for calculating the 
number of random walks that each source needs to initiate 
and for calculating the probability of trapping at each sen- 
sor. Another type of global information, namely, the maxi- 
mum node degree (i.e., the maximum number of neighbors) 
in the network, is also required to perform the Metropolis 



algorithm. However, for a large-scale sensor network, such 
global information may not be easy to obtain by each indi- 
vidual sensor, especially when there is possibility of change 
in topology. Moreover, the algorithms proposed in ifTTl as- 
sume that each sensor encodes only after receiving enough 
source packets. This requires each sensor to maintain a 
large enough temporary memory buffer, which may not be 
practical in real sensor networks. 

In this paper, we propose two new algorithms to solve 
the distributed storage problem in large-scale sensor net- 
works. We refer to these algorithms as LT-Codes based 
Distributed Storage-I (LTCDS-I) and LT-Codes based Dis- 
tributed Storage-II (LTCDS-II). Both algorithms use sim- 
ple random walks without trapping to disseminate source 
packets. In contrast to the methods in IfTTl . both algorithms 
demand little global information and memory at each sen- 
sor. In LTCDS-I, only the values of n and k are needed, 
whereas the maximum node degree, which is more difficult 
to obtain, is not required. In LTCDS-II, no sensor needs to 
know any global information (that is, knowing n and k is 
no longer required). Instead, sensors can obtain good es- 
timates for those parameters by using some properties of 
random walks. Moreover, in both algorithms, instead of 
waiting until all the necessary source packets are collected 
to do encoding, each sensor makes decisions and performs 
encoding online upon each reception of resource packets. 
This mechanism reduces the memory demand significantly. 

The main contributions of this paper are as follows: 

(i) We propose two new algorithms (LTCDS-I and 
LTCDS-II) for distributed storage in large-scale sen- 
sor networks, using simple random walks and LT 
codes. These algorithms are simpler, more robust, and 
less constrained in comparison to previous solutions. 

(ii) We present complexity analysis of both algorithms, 
including transmission, encoding, and decoding com- 
plexity. 

(iii) We evaluate and illustrate the performance of both al- 
gorithms by extensive simulation. 

This paper is organized as follows. We start with a short 
survey of the related work in Section [2] In Section [3] we 
introduce the network model and present Luby Transform 
(LT) codes. In Section @] we propose two LT codes based 
distributed storage algorithms called LTCDS-I and LTCDS- 
II. We then present simulation studies and provide perfor- 
mance analysis of the proposed algorithms in Section|5] and 
concluded in Section[6] 

2 Related Work 

The most related work to one presented here is IfTTl fTOl . 
Lin el al. studied the question "how to retrieve historical 
data that the sensors have gathered even if some sensors 
are destroyed or disappeared from the network?" They ana- 



lyzed techniques to increase persistence of sensed data in a 
random wireless sensor network, and proposed two decen- 
tralized algorithms using Fountain codes to guarantee the 
persistence and reliability of cached data on unreliable sen- 
sors. They used random walks to disseminate data from 
multiple sensors (sources) to the whole network. Based on 
the knowledge of the total number of sensors n and sources 
k, each source calculates the number of random walks it 
needs to initiate, and each sensor calculates the number of 
source packets it needs to trap. In order to achieve some de- 
sired packet distribution, the transition probabilities of ran- 
dom walks are specified by the well known Metropolis al- 
gorithm IfTTl . 

Dimakis el al. in fl4] |6) proposed a decentralized imple- 
mentation of Fountain codes that uses geographic routing, 
where every node has to know its location. The motivation 
for using Fountain codes is their low decoding complexity. 
Also, one does not know in advance the degrees of the out- 
put nodes in this type of codes. The authors proposed a 
randomized algorithm that constructs Fountain codes over a 
grid network using only geographical knowledge of nodes 
and local randomized decisions. Fast random walks are 
used to disseminate source data to the storage nodes in the 
network. 

Kamara el al. in (9j [8) proposed a novel technique called 
growth codes to increase data persistence in wireless sen- 
sor networks, namely, increase the amount of information 
that can be recovered at the sink. Growth coding is a lin- 
ear technique in which information is encoded in an online 
distributed way with increasing degree of a storage node. 
Kamara el al. showed that growth codes can increase the 
amount of information that can be recovered at any stor- 
age node at any time period whenever there is a failure in 
some other nodes. They did not use robust or soliton dis- 
tributions, but proposed a new distribution depending on 
the network condition to determine degrees of the storage 
nodes. The motivation for their work was that i) Positions 
and topology of the nodes are not known, ii) They assume 
a round time of node updates, meaning with increasing the 
time t, degree of a symbol is increased. This is the idea be- 
hind growth degrees, iii) They provide practical implemen- 
tations of growth codes and compare its performance with 
other codes, iv) The decoding part is done by querying an 
arbitrary sink, if the original sensed data has been collected 
correctly then finish, otherwise query another sink node. 

Lun el. al. in |fT3j proposed two decentralized algo- 
rithms to compute the minimum-cost subgraphs for estab- 
lishing multicast connections using network coding. Also, 
they extended their work to the problem of minimum- 
energy multicast in wireless networks as well as they stud- 
ied directed point-to-point multicast and evaluated the case 
of elastic rate demand. 



3 Wireless Sensor Networks and Fountain 
Codes 

In this section, we introduce our network model and pro- 
vide background of Fountain codes and, in particular, one 
important class of Fountain codes — LT (Luby Transform) 
codes Q3. 

3.1 Network Model 

Our wireless sensor network consists of n nodes that are 
uniformly distributed at random in a region A= [L,L] 2 for 
L > 1. The density of the network is given by 
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where |^4| is the two-dimensional Lebesgue measure (or 
area) of A. Each sensor node has an identical communi- 
cation radius 1; thus any two nodes can communicate with 
each other if and only if their distance is less than or equal to 
1 . This model is known as random geometric graphs lP7l [T5ll . 
Among these n nodes, there are k source nodes that have 
information to be disseminated throughout the network for 
storage. These k nodes are uniformly and independently 
distributed at random among the n nodes. Usually, the frac- 
tion of source nodes, i.e., — , is not very large (e.g., 10%, or 
20%). 

Note that, although we assume the nodes are uniformly 
distributed at random in a region, our algorithms and results 
do not rely on this assumption. In fact, they can be applied 
for any network topology, for example, regular grids. 

We assume that no node has knowledge about the lo- 
cations of other nodes and no routing table is maintained; 
consequently, the algorithm proposed in J51| cannot be ap- 
plied. Moreover, we assume that each node has limited or 
no knowledge of global information, but know its neigh- 
bors. The limited global information refers to the total num- 
bers of nodes n and sources k. Any further global informa- 
tion, for example the maximal number of neighbors in the 
network, is not available. Hence, the algorithms proposed 
in ifTTlfToll are not applicable. 

Definition 1. (Node Degree) Consider a graph G = 
(V, E), where V and E denote the set of nodes and links, 
respectively. Given u, v € V, we say u and v are adjacent 
(or u is adjacent to v, and vice versa) if there exists a link 
between u and v, i.e., (u, v) G E. In this case, we also 
say that u and v are neighbors. Denote by J\f(u) the set of 
neighbors of a node u. The number of neighbors of a node 
u is called the node degree of u, and denoted by d n (u), i.e., 
| Af(u) | = d n (it). The mean degree of a graph G is then 
given by 




Figure 1. The encoding operations of Foun- 
tain codes: each output is obtained by XOR- 
ing d source blocks chosen uniformly and in- 
dependently at random from k source inputs, 
where d is drawn according to a probability 
distribution Q(d). 



where \V\ is the total number of nodes in G. 

3.2 Fountain Codes 

For k source blocks {x\,X2 : ■ ■ ■ ,Xk} and a probabil- 
ity distribution fl(d) with 1 < d < k, a Fountain code 
with parameters (k, f2) is a potentially limitless stream of 
output blocks {yi, ?/2, ■■■}■ Each output block is obtained 
by XORing d randomly and independently chosen source 
blocks, where d is drawn from a specially designed distribu- 
tion fl (d). This is illustrated in Figure [3T2| Fountain codes 
are rateless, and one of their main advantage is that the en- 
coding operations can be performed online. The encoding 
cost is the expected number of operation sufficient for gen- 
erating an output symbol, and the decoding cost is the ex- 
pected number of operations sufficient to recover the k input 
blocks. Another advantage of Fountain codes, as opposed 
to purely random codes is that their decoding complexity 
can be made low by appropriate choice of il(d), with little 
sacrifice in performance. The decoding of Fountain codes 
can be done by message passing. 

Definition 2. ( Code Degree) For Fountain codes, the num- 
ber of source blocks used to generate an encoded output y 
is called the code degree of y, and denoted by d c {y). By 
construction, the code degree distribution fl(d) is the prob- 
ability distribution of d c (y). 

3.3 LT Codes 

LT (Luby Transform) codes are a special class of Foun- 
tain codes which uses Ideal Soliton or Robust Soliton dis- 
tributions [12|. The Ideal Soliton distribution ili S (d) for k 
source blocks is given by 
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Let R = coVk ln(fc/<5), where cq is a suitable constant and 
< S < 1. The Robust Soliton distribution for k source 
blocks is defined as follows. Define 
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The Robust Soliton distribution is given by 
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The following result provides the performance of the LT 
codes with Robust Soliton distribution |[T2l Theorems 12 
and 13]. 

Lemma 3 (Luby lTT2ll ). For LT codes with Robust Soliton 
distribution, k original source blocks can be recovered from 
any k + 0{s/k ln 2 (fc/(5)) encoded output blocks with prob- 
ability 1 — 5. Both encoding and decoding complexity is 
0(k\n{k/S)). 

4 LT- Codes Based Distributed Storage 
(LTCDS) Algorithms 

In this section, we present two LT-Codes based Dis- 
tributed Storage (LTCDS) algorithms. In both algorithms, 
the source packets are disseminated throughout the network 
by a simple random walk. In the first one, called LTCDS- 
I algorithm, we assume that each node in the network has 
limited the global information, that is, knows the total num- 
ber of sources k and the total number of nodes n. Unlike 
the scheme proposed in in iflOl . our algorithm does not re- 
quire the nodes to know the maximum degree of the graph, 
which is much harder to obtain than k and n. The second 
algorithm, called LTCDS-II, is a fully distributed algorithm 
which does not require nodes to know any global informa- 
tion. The price we pay for this benefit is extra transmissions 
of the source packets to obtain estimates for n and k. 

4.1 With Limited Global Information — 
LTCDS-I 

In LTCDS-I, we assume that each node in the network 
knows the values of k and n. We use simple random 
walks 0][17) for each source to disseminate its information 
to the whole network. At each round, each node u that has 



packets to transmit chooses one node v among its neighbors 
uniformly independently at random, and sends the packet 
to the node v. In order to avoid local-cluster effect — each 
source packet is trapped most likely by its neighbor nodes — 
we let each node accept a source packet equiprobably. To 
achieve this, we also need each source packet to visit each 
node in the network at least once. 

For a random walk on a graph, the cover time is defined 
as follows mini: 

Definition 4. ( Cover Time) Given a graph G, let T cover (u) 

be the expected length of a random walk that starts at node 
u and visits every node in G at least once. The cover time 
ofG is defined by 



-(G) = maxT COTer («). 

itGG 



(7) 



For a simple random walk on a random geometric graph, 
the following result bounds the cover time Q. 

Lemma 5 (Avin and Ercal J3])- If a random geometric 
graph with n nodes is a connected graph with high prob- 
ability, then 

T cover (G) = e(nlogn). (8) 

As a result of Lemma [5] we can set a counter for each 
source packet and increase the counter by one after each 
forward transmission until the counter reaches some thresh- 
old C\n log n to guarantee that the source packet visits each 
node in the network at least once. The detailed descriptions 
of the initialization, encoding and storage phases (steps) of 
LTCDS-I algorithm are given below: 

(i) Initialization Phase: 

(1) Each node u in the network draws a random num- 
ber d c (u) according to the distribution Oj s (d) 
given by <£3j (or Cl rs (d) given by (O). Each 
source node Sj, i = 1, . . . , k generates a header 
for its source packet x Si and puts its ID and a 
counter c(x Si ) with initial value zero into the 
packet header. We set up tokens for initial and up- 
date packets. We assume that a token is set to zero 
for an initial packet and 1 for an update packet. 

packet Si = (ID Si ,x Si ,c(x Si )) 

(2) Each source node Sj sends out its own source 
packet x Si to another node u which is chosen uni- 
formly at random among all its neighbors A/"(si). 

(3) The chosen node u accepts this source packet Si 



with probability 



d c (u) 



and updates its storage as 

Vu © x Si , (9) 



where y~ and t/+ denote the packet that the node 
u stores before and after the updating, respec- 
tively, and © represents XOR operation. No mat- 
ter whether the source packet is accepted or not, 



the node u puts it into its forward queue and set 
the counter of x Si as 

c(x Si ) = l. (10) 

(ii) Encoding Phase: 

(1) In each round, when a node u receives at least 
one source packet before the current round, u for- 
wards the head-of-line (HOL) packet x in its for- 
ward queue to one of its neighbor v, chosen uni- 
formly at random among all its neighbors Af(u). 

(2) Depending on how many times x has visited v, the 
node v makes its decisions: 

• If it is the first time that x visits v, then the node 
v accepts this source packet with probability ^ 
and updates its storage as 

y+=y~(Bx. (11) 

• If x has visited v before and c(x) < Cinlogn 
where C\ is a system parameter, then the node 
v accepts this source packet with probability 0. 

• No matter x is accepted or not, the node v 
puts it into its forward queue and increases the 
counter of x by one: 

c{x) = c{x) + 1. (12) 

• If x has visited v before and c(x) > Ci^logn 
then the node v discards the packet x forever. 

(iii) Storage Phase: 

When a node u makes its decisions for all the source 
packets x Sl ,x S2 , ...,x Sk , i.e., all these packets have 
visited the node u at least once, the node u finishes 
its encoding process by declaring the current y u to be 
its storage packet. 
The pseudo-code of these steps is given in LTCDS-I Al- 
gorithmQ] 

The following theorem establishes the code degree dis- 
tribution of each storage node induced by the LTCDS-I al- 
gorithm. 

Theorem 6. When a sensor network with n nodes and k 
sources finishes the storage phase of the LTCDS-I algo- 
rithm, the code degree distribution of each storage node u 
is given by 

Pr(4(u) = i) 
■t0(f)l'-f)W..3. 

d c (tl) = l V 7 V 7 V 7 

where d c (u) is given in the initialization phase of the 
LTCDS-I algorithm from distribution fl'(d) (i.e., Qi s (d) or 
Q r s{d)), and d c (u) is the code degree of the node u result- 
ing from the algorithm. 



Input: number of nodes n, number of sources k, 
source packets x Si , i = 1, 2, k and a 
positive constant C\ 
Output: storage packets y^i = 1, 2, ...,n 
foreach node u = 1 : n do 

Generate d c (u) according to Cli s (d) (or il rs (d)); 

end 

foreach source node s^, i = 1 : k do 

Generate header of x Si and token = 0; 
c(x Si ) = 0; 

Choose u E J\f{si) uniformly at random, send x Si 
to u; 

coin = rand(l); 

if coin < then y u = y u © x Si ; 
Put x Si into us forward queue; 

c(x Sz ) = c(x Si ) + 1; 

end 

while source packets remaining do 

foreach node u receives packets before current 
round do 

Choose v E J\f(u) uniformly at random; 
Send HOL packet x Si in u's forward queue to 

v; 

if v receives x Si for the first time then 
coin = rand(l); 
if coin < then 

yv = yv © x Si ; 

Put x Si into v's forward queue; 

c{x Si ) = c{x Si ) + 1 

end 

else if c(x Si ) < C\n log n then 
Put x Si into v's forward queue; 

c(x St ) = c(x St ) + 1; 

else 

Discard x Si ; 

end 

end 

end 

Algorithm 1: LTCDS-I Algorithm: LT-Codes based Dis- 
tributed Storage Algorithm for a wireless sensor network 
(WSN) with limited global information, i.e., values of n 
and k are known at every node. It consists of three phases: 
initialization, encoding and storage phases. The algorithm 
can also be deployed in a WSN after estimating values of 
n an fc, as shown in LTCDS-II algorithm. 



Proof. For each node u, d c (u) is drawn from a distribution 
il'(d) (i.e., fti S (d) or il rs (d)). Given d c (u), the node u 
accepts each source packet with probability iMiii indepen- 
dently of each other and d c (u). Thus, the number of source 
packets that the node u accepts follows a Binomial distribu- 




(a) (b) 



Figure 2. Code degree distribution compar- 
ing: (a) Ideal Soliton distribution <>, s (given 
by ©) and the resulting degree distribution 
from LTCDS-I algorithm (given by O)- Here 
k = 40; (b) Robust Soliton distribution fl rs 
(given by ©) and the resulting degree distri- 
bution from LTCDS-I algorithm (given by (13\>). 
Here k = 40, c = 0.1 and 5 = 0.5. 



tion with parameter ■ Hence, 

Pr(d c (u) = i) 

k 

= Pr (4(w) = i\d c (u))n'{d c (u) 

d c {u) — l 

■ ) WD. 

d c («)=l v 7 v 7 v 7 
and thereafter (Qj) holds. □ 

Theorem|6]indicates that the code degree d c (u) is not the 
same as d c {u). In fact, one may achieve the exact desired 
code degree distribution by letting all the sensors hold the 
received source packets in their temporary buffer until they 
collect all k source packets. Then they can randomly choose 
d c (u) packets. In this way, the resulting degree distribution 
is exactly the same as Oj s or fi rs . However, this requires 
that each sensor has enough buffer or memory, which is usu- 
ally not practical, especially when k is large. Therefore, in 
LTCDS-I, we assume each sensor has very limited memory 
and let them make their decision upon each reception. 

Fortunately, from Figure [2] we can see that at the high 
degree end, the resulting code degree distribution obtained 
by the LTCDS-I algorithm (TT~3T > perfectly matches the de- 
sired code degree distribution, i.e., either the Ideal Soliton 
distribution f2j S (01 or the Robust Soliton distribution fl rs 
(O. For the resulting degree distribution and the desired de- 
gree distributions, the difference only lies at the low degree 
end, especially at degree 1 and degree 2. In particular, the 
resulting degree distribution has higher probability at de- 
gree 1 and lower probability at degree 2 than the desired de- 



gree distributions. The fact that higher probability at degree 
1 turns out to compensate the lower probability at degree 2 
so that the resulting degree distribution has very similar en- 
coding and decoding behavior as LT codes using either the 
Ideal Soliton distribution or the Robust Soliton distribution. 
In our future study, we will provide theoretical analysis and 
prove that the degree distribution in Qj] is equivalent, but 
not the same, as the degree distributed used in LT encod- 
ing lfl2l . Therefore, we have the following theorem, which 
can be proved by the same method for Lemma[5J see Q~2). 

Theorem 7. Suppose sensor networks have n nodes and k 
sources and the LTCDS-I algorithm uses the Robust Soliton 
distribution Q rs . Then, when n and k are sufficient large, 
the k original source packets can be recovered from any 
k + 0(\/k ln 2 (fc/<5)) storage nodes with probability 1 — S. 
The decoding complexity is 0{k \n{k/ 8)). 

Theorem [7] asserts that when n and k are sufficiently 
large, the performance of the LTCDS-I is similar to LT cod- 
ing. 

Another main performance metric is the transmission 
cost of the algorithm, which is characterized by the total 
number of transmissions (the total number of steps of k ran- 
dom walks). 

Theorem 8. Denote by T^ CDS the total number of trans- 
missions of the LTCDS-I algorithm, then we have 

T^cDs^Qiknlogn), (14) 

where k is the total number of sources, and n is the total 
number of nodes in the network. 

Proof. We know that each one of k source packets is 
stooped and discarded if and only if it has been forwarded 
for C\n log(n) times, for some constant C\. Then the total 
number of transmissions of the LTCDS-I algorithm for all k 
packets is a direct consequence and it is given by ( fl4l i. □ 

4.2 Without any Global Information — 
LTCDS II 

In many scenarios, especially when a change in network 
topology occurs because of, for example, node mobility or 
node failures, the exact values of n and k may not be avail- 
able to all nodes. Therefore, to design a fully distributed 
storage algorithm which does not require any global infor- 
mation is very important and useful. In this subsection, 
we present such an algorithm based on LT codes, called 
LTCDS-II. The idea behind this algorithm is to utilize some 
features of simple random walks to do inference to obtain 
individual estimates of n and k for each node. 

We introduce of inter-visit time and inter-packet time |fl] 
El El as follows: 



Definition 9. (Inter-Visit Time) For a random walk on 
a graph, the inter- visit time of node u, T v i s u (u), is the 
amount of time between any two consecutive visits of the 
random walk to node u. This inter-visit time is also called 
return time. 

For a simple random walk on random geometric graphs, 
the following lemma provides results on the expected inter- 
visit time of any node. The proof is straightforward by 
following the standard result of stationary distribution of a 
simple random walk on graphs and the mean return time for 
a Markov chain |fl] [17] Q3) . For completeness, we provide 
the proof in Appendix 6.1. 

Lemma 10. For a node u with node degree d n (u) in a ran- 
dom geometric graph, the mean inter-visit time is given by 

pn 



E[T vi8it (i 



d n (u) ' 



(15) 



where p is the mean degree of the graph given by Equa- 
tion (0. 

From Lemma [10] we can see that if each node u can 
measure the expected inter-visit time E[T V i S i t (u)], then the 
total number of nodes n can be estimated by 

_ d n (u)E[T visz t(u)] 



(16) 



However, the mean degree /i is a global information and 
may be hard to obtain. Thus, we make a further approxima- 
tion and let the estimate of n by the node u be 



h(u) = E[T visit (u)]. 



(17) 



Hence, every node u computes its own estimate of n. In 
our distributed storage algorithms, each source packet fol- 
lows a simple random walk. Since there are k sources, we 
have k individual simple random walks in the network. For 
a particular random walk, the behavior of the return time is 
characterized by Lemma[l0] On the other hand, Lemma[T2l 
below provides results on the inter-visit time among all k 
random walks, which is called inter-packet time for our al- 
gorithm, defined as follows: 

Definition 11. (Inter-Packet Time) For k random walks on 
a graph, the inter-packet time of node u, T packe t{u), is the 
amount of time between any two consecutive visits of those 
k random walks to node u. 

For the mean value of inter-packet time, we have the fol- 
lowing lemma, for which the proof is given in Appendix 6.2. 

Lemma 12. For a node u with node degree d n (u) in a ran- 
dom geometric graph with k simple random walks, the mean 
inter-packet time is given by 

E[T vislt (u)\ 



E[T packet (u)} 



fin 



k kd n (u) 
where p is the mean degree of the graph given by (f2]). 



(18) 



From Lemma[lO]and Lemma[T2l it is easy to see that for 
any node u, an estimation of k can be obtained by 



r E[T vlsit {u)] 

k(U) = : — — . 

E[T packet (u)\ 



(19) 



After obtaining estimates for both n and k, we can em- 
ploy similar techniques used in LTCDS-I to do LT coding 
and storage. The detailed descriptions of the initialization, 
inference, encoding, and storage phases of LTCDS-II algo- 
rithm are given below: 

(i) Initialization Phase: 

(1) Each source node Si,i = l,...,k generates a 
header for its source packet x Si and puts its ID 
and a counter c(x Si ) with initial value zero into 
the packet header. 

(2) Each source node Si sends out its own source 
packet x Si to one of its neighbors u, chosen uni- 
formly at random among all its neighbors Af(si). 

(3) The node u puts x Si into its forward queue and 
sets the counter of x=. as 



c{x Si ) = 1. 



(20) 



(ii) Inference Phase: 

(1) For each node u, suppose x a t u \ x is the first source 

(i) 

packet that visits u, and denote by 0„v the time 
when has its j-th visit to the node u. Mean- 

while, each node u also maintains a record of 
visiting time for each other source packet x s ( u ^. 



that visited it. Let t 



s(u)i 



be the time when source 



packet x 8 ( u ). has its j-th visit to the node u. After 
£s(m)i visiting the node u C2 times, where C2 is 
system parameter which is a positive constant, the 
node u stops this monitoring and recoding proce- 
dure. Denote by k(u) the number of source pack- 
ets that have visited at least once upon that time. 
(2) For each node u, let J(s(u)i) be the number of 
visits of source packet x s ( u ) . to the node u and let 

t - \" J.3+1) _Aj) nn 

s{uh ~ J(«(u)<) jr[ s ^ °m> {ZL) 

~ J(«(«)<) 1 *.(«), M-" J 

Then, the average inter-visit time for node u is 
given by 



fc(«) 



■(«) = ^)g T ^- (23) 



Let J min = min s(u)! {i^ ) .} and J max = 
max s ( M ).{t^^ U ' ), ' ) ' ) }, then the inter-packet time 



is given by 



4.3 Updating Data 



Tpacket(u) = 



J rain «A 



max 



(24) 



Then the node u can estimate the total number 
of nodes in the network and the total number of 
sources as 

n(u) = T visit (u), (25) 



and 



k(u) 



T visit (^) 



(26) 



Tpacket(u) 

(3) In this phase, the counter c{x Si ) of each source 
packet c(x Si ) is incremented by one after each 
transmission. 

(iii) Encoding Phase: 

When a node u obtains estimates 77(77) and k(u), it be- 
gins encoding phase which is the same as the one in 
LTCDS-I Algorithm except that the code degree d c {u) 
is drawn from distribution f2j S (d) (or tt rs (d)) with re- 
placement of k by k(u), and a source packet x Si is 
discarded if c(x Si ) > 6377,(11) log h(u), where C3 is a 
system parameter which is a positive constant. 

(iv) Storage Phase: 

When a node u has made its decisions for k source 
packets, it finishes its encoding process and y u be- 
comes the storage packet of u. 
The total number of transmissions (the total number of 

steps of k random walks) in the LTCDS-II algorithm has 

the same order as LTCDS-I. 



Theorem 13. Denote by T 



(ii) 



LTCDS 



the total number of 



transmissions of the LTCDS-II algorithm, then we have 



T, 



(ii) 



LTCDS 



Q(kn log 77), 



(27) 



where k is the total number of sources, and 77 is the total 
number of nodes in the network. 

Proof. In the interference phase of the LTCDS-II algorithm, 
the total number of transmissions is upper bounded C'n for 
some constants C > 0. That is because each node needs 
to receive the first visit source packet for C2 times, and by 
LemmafTol the mean inter- visit time is 0(77). 

In the decoding phase, the same as in the LTCDS-I al- 
gorithm, in order to guarantee that each source packet visits 
all the nodes at least once, the number of steps of the sim- 
ple random walk is 9(?7 logn). In other words, each source 
packet is stopped and discarded if and only if the counter 
reaches the threshold C377 log(n) for some system parame- 
ter C3. Therefore, we have (f2Tb . □ 



Now, we turn our attention to data updating after all stor- 
age nodes saved their values yi,y2, ■ ■ • but a sensor 
node, say Si, wants to update its value to the appropriate 
set of storage nodes in the network. The following updat- 
ing algorithm applies for both LTCDS-I and LTCDS-II. For 
simplicity, we illustrate the idea with LTCDS-I. 

Assume the sensor node prepared a packet with its ID, 
old data x Si , new data x' s along with a time-to-live param- 
eter c(si) initialized to zero. We will use also a simple ran- 
dom walk for data update. 



packet s% = (ID Si ,x Si ® x' , c(si)). 



(28) 



If we assume that the storage nodes keep ID's of the ac- 
cepted packets, then the problem becomes simple. We just 
run a random walk and check for the coming packet's ID. 
Assume the node u keeps track of all ID's of its accepted 
packets. Then u accepts the updated message if ID of the 
coming packet is already included in the it's ID list. Oth- 
erwise u forwards the packet incrementing the time-to-live 
counter. If this counter reaches the threshold value, then the 
packet will be discarded. 

The following steps describe the update scenario: 

(i) Preparation Phase: 

The node s, prepares its new packet with the new and 
old data along with its ID and counter. Also, Sj add an 
update counter token initialized at 1 for the first up- 
dated packet. So, we assume that the following steps 
happen when token is set to 1. 



packet Si = (ID Si ,x Si © x' , c{si)). 



(29) 



Si chooses at random a neighbor node it, and sends its 

packet Si . 

(ii) Encoding Phase: 

The node u checks if the packet Si is an update or first- 
time packet. If it is first-time packet it will accept, for- 
ward, or discard it as shown in LTCDS-I algorithmQ] 
If packet Si is an updated packet, then the node u will 
check if ID Si is already included in its accepted list. 
If yes, then it will update its value y u as follows. 



(30) 



If no, it will add this updated packet into its forward 
queue with incrementing the counter 



c«) = c«) + l. 



(31) 



The packet Si will be discarded if c(x' s . ) > Ci?7 log 77 
where C\ is a system parameter. In this case, we need 
C\ to be large enough, so all old data x Si will be up- 
dated to the new data x' . 



(iii) Storage Phase: 

If all nodes are done with updating their values t/j. 
One can run the decoding phase to retrieve the orig- 
inal and update information. 
Now, since we run only one simple random walk for each 

update, if h is the number of nodes updating their values, 

then we have the following result. 

Lemma 14. The total number of transmissions needed for 
the update process is bounded by Q(hn log n). 

5 Performance Evaluation 

In this section, we study performance of the proposed 
LTCDS-I and LTCDS-II algorithms for distributed storage 
in wireless sensor networks through simulation. The main 
performance metric we investigate is the successful decod- 
ing probability versus the decoding ratio. 

Definition 15. (Decoding Ratio) Decoding ratio r) is the ra- 
tio between the number of queried nodes h and the number 
of sources k, i.e., 



Definition 16. (Successful Decoding Probability) Success- 
ful decoding probability P s is the probability that the k 
source packets are all recovered from the h querying nodes. 

In our simulation, P s is evaluated as follows. Suppose 
the network has n nodes and k sources, and we query h 
nodes. There are (?) ways to choose such h nodes, and we 
pick one tenth of these choices uniformly at random: 

M = IoUj = 10- h!(n-h)V (33) 

Let M s be the size of the subset these M choices of h query 
nodes from which the k source packets can be recovered. 
Then, we evaluate the successful decoding probability as 

= — • (34) 

M 

Figure [3] shows the decoding performance of LTCDS-I 
algorithm with Ideal Soliton distribution with small num- 
ber of nodes and sources. The network is deployed in 
A = [5, 5] 2 , and the system parameter C\ is set as C\ = 5. 
From the simulation results we can see that when the decod- 
ing ratio is above 2, the successful decoding probability is 
about 99%. Another observation is that when the total num- 
ber of nodes increases but the ratio between k and n and the 
decoding ratio 77 are kept as constants, the successful decod- 
ing probability P s increases when r\ > 1.5 and decreases 
when Tj < 1.5. This is also confirmed by the results shown 
in Figure [4] In Figure [4] The network has constant density 
as A = -j^ and the system parameter C\ = 3. 
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Figure 3. Decoding performance of LTCDS- 
I algorithm with small number of nodes and 
sources 
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Figure 4. Decoding performance of LTCDS-I 
algorithm with medium number of nodes and 
sources 



In Figure[5] we fix the decoding ratio 77 as 1 .4 and 1 .7, re- 
spectively, and fix the ratio between the number of sources 
and the number of nodes as 10%, i.e., k/n = 0.1, and 
change the number of nodes n from 500 to 5000. From 
the results, it can be seen that as n grows, the successful 
decoding probability increases until it reaches some plat- 
form which is the successful decoding probability of real 
LT codes. This confirms that LTCDS-I algorithm has the 
same asymptotical performance as LT codes. 

To investigate how the system parameter C\ affects the 
decoding performance of the LTCDS-I algorithm, we fix the 
decoding ratio 77 and change C\ . The simulation results are 
shown in Figure [6] For the scenario of 1000 nodes and 100 
sources, r\ is set as 1 .6, and for the scenario of 500 nodes 




Figure 5. Decoding performance of LTCDS-I 
algorithm with different number of nodes 



Figure 7. Decoding performance of LTCDS- 
II algorithm with small number of nodes and 
sources 
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Figure 6. Decoding performance of LTCDS-I 

algorithm with different system parameter d Figure 8. Decoding performance of LTCDS-II 

algorithm with medium number of nodes and 
sources 



and 50 sources, r/ is set as 1.8. The code degree distribution 
is also the Ideal Soliton distribution, and the network is de- 
ployed in A = [15, 15] 2 . It can be seen that when C\ > 3, 
P s keeps almost like a constant, which indicates that after 
3n log n steps, almost all source packets visit each node at 
least once. 

Figure|7]compares the decoding performance of LTCDS- 
II and LTCDS-I with Ideal Soliton distribution with small 
number of nodes and sources. As in Figure 3, the network 
is deployed in A = [5, 5] 2 , and the system parameter is set 
as C3 = 10. To guarantee each node obtain accurate esti- 
mations of ?i and k, we set C2 = 50. It can be seen that 
the decoding performance of the LTCDS-II algorithm is a 
little bit worse than the LTCDS-I algorithm when decoding 
ratio f] is small, and almost the same when 77 is large. Fig- 
ure 8 compares the decoding performance of LTCDS-II and 



LTCDS-I with Ideal Soliton distribution with medium num- 
ber of nodes and sources, where the network has constant 
density as A = ^ and the system parameter C3 = 20. 
We observe different phenomena. The decoding perfor- 
mance of the LTCDS-II algorithm is a little bit better than 
the LTCDS-I algorithm when decoding ratio rj is small, and 
almost the same when 77 is large. That is because for the 
simulation in Figure[8] we set C3 = 20 which is larger than 
C3 = 10 set for the simulation in Figure 6. The larger value 
of C3 guarantees that each node has the chance to accept 
each source packet, which results in a more uniformly dis- 
tribution. 

Figure |9]-Figure [10] shows the histogram of the estima- 
tion results of n and k of each node for three scenarios: Fig- 
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(a) (b) 

Figure 9. Estimation results in LTCDS-II algo- 
rithm with n = 200 nodes and k = 20 sources: 
(a) estimations of n; (b) estimations of k. 
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(a) (b) 

Figure 10. Estimation results in LTCDS-II al- 
gorithm with n = 1000 nodes and k = 100 
sources: (a) estimations of n; (b) estimations 
of k. 



ure [9] shows the results for 200 nodes and 20 sources; and 
Figure 10 shows the results for 1000 nodes and 100 sources. 
In the first two scenarios, we set C2 = 50. From the results 
we can see that, the estimations of k are more accurate and 
concentrated than the estimations of n. This is because the 
estimation of k only depends on the ratio between the ex- 
pected inter-visit time and the expected inter-packet time, 
which is independent of the mean degree fj, and the node 
degree d n {u). On the other hand, the estimation of n is ac- 
tually depends on /i and d n (u). However, in the LTCDS-II 
algorithm, each node approximates /i as its own node de- 
gree d n (u), which causes the deviation of the estimations 
of n. 

To investigate how the system parameter C2 affects the 
decoding performance of the LTCDS-II algorithm, we fix 
the decoding ratio 77 and C3, and change C2. The simula- 
tion results are shown in Figure QT| From the simulation 
results, we can see that when C2 is chosen to be small, the 
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Figure 11. Decoding performance of LTCDS-II 
algorithm with different system parameter C 2 



performance of the LTCDS-II algorithm is very poor. This 
is due to the inaccurate estimations of k and n of each node. 
When C2 is large, for example, when C2 > 30, the perfor- 
mance is almost the same. 



6 Conclusion 



In this paper, we studied a model for large-scale wireless 
sensor networks, where the network nodes have low CPU 
power and limited storage. We proposed two new decen- 
tralized algorithms that utilize Fountain codes and random 
walks to distribute information sensed by k sensing source 
nodes to n storage nodes. These algorithms are simpler, 
more robust, and less constrained in comparison to previ- 
ous solutions that require knowledge of network topology, 
maximum degree of a node, or knowing values of n and 
k |6j [9] [lOl [TTll . We computed the computational encod- 
ing and decoding complexity of these algorithms and simu- 
lated their performance with small and large numbers of k 
and n nodes. We showed that a node can successfully esti- 
mate the number of sources and total number of nodes if it 
can only compute the inter-visit time and inter-packet time. 

Our future work will include Raptor codes based dis- 
tributed networked storage algorithms for sensor networks. 
We also plan to provide theoretical results and proofs for 
the results shown in this paper, where the limited space is 
not an issue. Our algorithm for estimating values of n and 
k is promising, we plan to investigate other network models 
where this algorithm is beneficial and can be utilized. 



Acknowledgments 

The authors would like to thank the reviewers for their 
comments. They would like to express their gratitude to all 
Bell Labs & Alcatel-Lucent staff members for their hospi- 
tality and kindness. 

7 Appendix 

7.1 Proof of Lemma 1101 

Proof. For a simple random walk on an undirected graph 
G = (V, E), the stationary distribution is given by ITl[T7l 
M 



p{u) 



d„(u) 
2\E\ ' 



(35) 



On the other hand, for a reversible Markov chain, the 
expected return time for a state i is given by ITl ll7|[T4l 



1 



ttOO ; 



(36) 



where ir(i) is the stationary distribution of state i. 

From d35l ) and d36l l, we have for a simple random on a 
graph, the expected inter-visit time of node u is 



E[T visit (u)] 



2\E\ \in 



(37) 



d n {u) d n (uY 
where /i is the mean degree of the graph. □ 

7.2 Proof of Lemma I 



Proof. For a given node u and k simple random walks, each 
simple random walk has expected inter- visit time j^^y- We 
now view this process from another perspective: we assume 
there are k nodes {vi, Vk} uniformly distributed in the 
network and an agent from node u follows a simple ran- 
dom walk. Then the expected inter-visit time for this agent 
to visit any particular Vi is the same as . However, 
the expected inter-visit time for any two nodes v,i and Vj is 
I d^lu) ' wn i cn gi yes the expected inter-packet time. □ 
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